Bayesian clustering of multiple zero-inflated outcomes

Several applications involving counts present a large proportion of zeros (excess-of-zeros data). A popular model for such data is the hurdle model, which explicitly models the probability of a zero count, while assuming a sampling distribution on the positive integers. We consider data from multiple count processes. In this context, it is of interest to study the patterns of counts and cluster the subjects accordingly. We introduce a novel Bayesian approach to cluster multiple, possibly related, zero-inflated processes. We propose a joint model for zero-inflated counts, specifying a hurdle model for each process with a shifted Negative Binomial sampling distribution. Conditionally on the model parameters, the different processes are assumed independent, leading to a substantial reduction in the number of parameters as compared with traditional multivariate approaches. The subject-specific probabilities of zero-inflation and the parameters of the sampling distribution are flexibly modelled via an enriched finite mixture with random number of components. This induces a two-level clustering of the subjects based on the zero/non-zero patterns (outer clustering) and on the sampling distribution (inner clustering). Posterior inference is performed through tailored Markov chain Monte Carlo schemes. We demonstrate the proposed approach on an application involving the use of the messaging service WhatsApp. This article is part of the theme issue ‘Bayesian inference: challenges, perspectives, and prospects’.


Introduction
Count data presenting excess of zeros are commonly encountered in applications. These can arise in several settings, such as healthcare, medicine or sociology. In this scenario, the observations carry structural information about the data-generating process, i.e. an inflation of zeros. The analysis of zero-inflated data requires the specification of models beyond standard count distributions, such as Poisson or Negative Binomial. Commonly used models are the zero-inflated [1], the hurdle [2] and the zero-altered [3] models. The first class assumes the existence of a probability mass at zero and a distribution over N 0 = {0, 1, 2, . . .}. This type of model explicitly differentiates between the zeros originating from a common underlying process, such as the utilization of a service, described by the sampling distribution on N 0 , and those arising from a structural phenomenon, such as the ineligibility to use the service, which are modelled by the point mass. Very popular zero-inflated models are the zero-inflated Poisson (ZIP) and the zero-inflated negative binomial (ZINB) models, where the sampling distribution is chosen to be a Poisson and a negative binomial, respectively. These models allow for inflation in the number of zeros and departures from standard distributional assumptions on the moments of the sampling distribution. For instance, the ZIP model allows the mean and the variance of the distribution to be different from each other (as opposed to a standard Poisson distribution), while the ZINB additionally captures overdispersion in the data.
Hurdle models are a very popular choice of distributions for modelling zero-inflated counts. Differently from the zero-inflated ones, these models handle zeros and positive observations separately, assuming on the latter a sampling distribution with support on N = N 0 \ {0}. Thus the distribution of the count data is given by where p i and g now capture two distinct features of the data. Hurdle models present appealing features that can make them preferable to zero-inflated models. Firstly, hurdle distributions allow for both inflation and deflation of zero counts. Indeed, under a zero-inflated model, the probability of observing a zero is always greater than the corresponding probability under the sampling distribution, thus making it impossible to capture deflation in the number of zeros [4]. Secondly, and more importantly for our work, the probability of zero counts in hurdle models is independent of the parameters controlling the distribution of non-zero counts. This feature improves interpretability and facilitates parameter estimation. Note that the zero-altered model proposed by Heilbron [3] is a modified hurdle model in which the two parts are connected by specifying a direct link between the model parameters.
Univariate models for zero-inflated data can be extended to multivariate settings, where several variables presenting excess of zeros are recorded, e.g. in applications involving questionnaires or microbiome data analysis. In this context, a multivariate extension of the ZIP model has been proposed by Li et al. [5], through a finite mixture with ZIP marginals. In this construction, the number of parameters increases linearly as the number d of zero-inflated processes increases, as the total number of parameters is 3d + 2. See also Liu et al. [6,7] and Tian et al. [8] for simplified versions of the previous construction involving a smaller number of parameters and better distributional properties.
In a Bayesian parametric setting, Fox [9] proposes the joint modelling of two related zeroinflated outcomes. Their strategy is based on the ZIP model, with the same Bernoulli component to capture the extra zeros for both processes. Correlation between subject-specific outcomes is accounted for through the specification of a joint random effect distribution for the parameters governing the sampling distribution of the two processes. Alternatively, Lee et al. [10] model the binary variables indicating whether an observation is positive or not via a multivariate probit model [11,12]. In this approach, the vectors of latent continuous variables characterizing the multivariate probit are modelled jointly assuming a random unstructured correlation matrix describing their dependence.
In several applications, knowledge relative to the grouping of the subjects is also available, thus providing additional information that can be exploited in the model [13]. Moreover, the clustering structure can be estimated by assuming a prior distribution on the partition of the subjects, e.g. via the popular Dirichlet process [14] or a mixture with a random number of components as proposed by Hu et al. [15]. In the context of Bayesian semiparametric approaches, Shuler et al. [16] propose to model multivariate zero-inflated count data by linking different Dirichlet process mixtures of ZINB models through the use of the popular dependent Dirichlet process [17]. In particular, the probability of zeros and the sampling distribution are modelled via two distinct single-p DDP, where the location parameters of the mixture depend on a categorical covariate. The proposed approach yields flexible estimation of the partition of the subjects, although it does not allow for sharing of information a priori between the two components of the ZINB model, thus yielding two separate clustering structures. A different semiparametric approach is proposed by Arab et al. [18], which exploits the multivariate ZIP construction of Li et al. [5] to model bivariate count data, but the proportion of zeros and the intensity of the sampling distribution are modelled through the introduction of spline regression terms. The spline approach is flexible and computationally tractable when d is small. For larger dimensions, this model would induce a non-trivial computational burden.
The focus of this work is clustering of individuals based on multiple, possibly related, zero-inflated processes. To this end, we propose a Bayesian approach for joint modelling of zeroinflated count data, based on finite mixtures with a random number of components. In particular, we specify a hurdle model for each process with a shifted negative binomial sampling distribution on the positive integers. Let n denote the sample size and d is the number of processes under study. The subject-specific probabilities of zero-inflation p ij for the ith individual and the jth process, i = 1, . . . , n, j = 1, . . . , d, and the parameter vector of the sampling distribution μ ij are flexibly modelled via an enriched mixture with a random number of components, borrowing ideas from the Bayesian non-parametric literature on the Dirichlet process. One of the main novelties of our work is to combine a recent representation of finite mixture models with a random number of components presented in Argiento & De Iorio [19] with a finite extension of the enriched non-parametric prior proposed by Wade et al. [20] to achieve a two-level clustering of the subjects, where at the outer level individuals are clustered based on the pattern of zero/nonzero observations, while within each outer cluster they are grouped at a finer level (which we refer to as inner level) according to the distribution of the non-zero counts. Figure 1 provides an illustration of the nested clustering structure.
Enriched priors in Bayesian non-parametrics generalize concepts developed by Consonni & Veronese [21], who propose a general methodology for the construction of enriched conjugate families for the parametric natural exponential families. The idea underlying this approach is to decompose the joint prior distribution for a vector of parameters indexing a multivariate exponential family into tractable conditional distributions. In particular, distributions belonging to the multivariate natural exponential family satisfy the conditional reducibility property, which allows reparameterizing the distribution in terms of a parameter vector, whose components are variation and likelihood independent. Then, it is possible to construct an enriched standard conjugate family on the parameter vector, closed under i.i.d. sampling, which leads to the breaking down of the global inference procedure into several independent subcomponents. Such parameterization achieves greater flexibility in prior specification relative to the standard conjugate one, while still allowing for efficient computations (see, for example, [22]). An example of this class of parametric priors is the enriched Dirichlet distribution [23].
In a Bayesian non-parametric framework, Wade et al. [20] first propose an enrichment of the Dirichlet process [24] that is more flexible with respect to the precision parameter but still royalsocietypublishing.org/journal/rsta Phil. Trans. R. Soc. A 381: conjugate, by defining a joint random probability measure on the measurable product space (X , Y) in terms of the marginal and conditional distributions, P X and P Y|X , and assigning independent Dirichlet process priors to each of these terms. The enriched Dirichlet process enables a nested clustering structure that is particularly appealing in our setting and allows for a finer control of the dependence structure between X and Y. This construction has been employed also in non-parametric regression problems to model the joint distribution of the response and the covariates [25,26], as well as in longitudinal data analysis [27] and causal inference [28].
Recently, Rigon et al. [29] proposed the enriched Pitman-Yor process, which leads to a more robust clustering estimation.
In this work, we consider the joint distribution of d zero-inflated process, where the ddimensional vectors of probabilities (p i1 , . . . , p id ) correspond to X, while the parameters of the sampling distributions μ ij correspond to Y. The enrichment of the prior is achieved by modelling both P X and P Y|X through a mixture with a random number of components (see, for instance, [30]). We exploit the recent construction by Argiento & De Iorio [19] based on normalized independent finite point processes (Norm-IFPP), which allows for a wider choice of prior distributions for the unnormalized weights of the mixture. Therefore, the proposed model offers more flexibility, while preserving computational tractability.
The motivating application for the proposed model is the analysis of multiple count data collected from a questionnaire on the frequency of use of the messaging service WhatsApp [31]. In particular, the questionnaire concerns the sharing of COVID-19-related information via WhatsApp messages, either directly or by forwarding, over the course of a week. For each subject, responses to the same seven questions are recorded over seven consecutive days, providing information on a subject's WhatsApp use (see the electronic supplementary material, Table S1). In this set-up, the multiple count processes correspond to the seven questions, all of which display an excess of zeros (see the electronic supplementary material, Figure S2).
The manuscript is organized as follows. Section 2 introduces a novel enriched prior process for multiple zero-inflated outcomes, while §3 describes the Markov chain Monte Carlo (MCMC) algorithm designed for posterior inference. We demonstrate the model on the WhatsApp application in §4. We conclude the paper in §5.

The model (a) Likelihood
To take into account the zero-inflated nature of the data, we assume for each outcome j a hurdle model. Each observed count Y ij is equal to zero with probability 1 − p ij , while with probability p ij it is distributed according to a probability mass function (pmf) g(· | μ ij ) with support on N. Assuming conditional independence among responses, the likelihood for a subject is given by In what follows, we set g to be a shifted negative binomial distribution with parameters μ ij = (r ij , θ ij ) and pmf: where r ij ∈ N and θ ij ∈ (0, 1), for i = 1, . . . , n and j = 1, . . . , d. Different parametric choices for g are possible (e.g. a shifted Poisson), or even non-parametric alternatives could be employed. Note that the conditional independence assumption among the multiple processes leads to a significant reduction in the number of parameters as compared with multivariate zero-inflated models.

(b) Enriched finite mixture model
In this work, we propose an enriched extension of the Norm-IFPP of Argiento & De Iorio [19] and specify a joint prior for (p i , μ i ) as conditionally dependent processes. This allows us to account for interindividual heterogeneity, overdispersion and outliers and induces data-driven nested clustering of the observations. Each subject is first assigned to an outer cluster, and then clustered again at an inner level, providing increased interpretability. Differently from previous work on Bayesian non-parametric enriched processes, we opt for a finite mixture with a random number of components, where the weights are obtained through the normalization of a finite point process.
Compute K, the number of allocated components at the outer level Relabel the outer level clusters so that the first K components of the mixture are allocated Sample the unnormalized weights of the outer measure from Sample the unnormalized weights of the mth inner mixture from where n ms is the cardinality of inner level cluster s and n ms = 0 for s > K m for s in 1:K m do Sample (r ms , θ ms ) from the full conditional end for for s in (K m + 1):S m do Sample r ms from the prior Sample θ ms from the prior end for end for for m in (K + 1):M do Sample p m and S m from the prior for s in 1:S m do Sample ms from the prior Sample r ms from the prior Sample θ ms from the prior end for end for in the last years (see, for example, [30,32]). The representation of Argiento & De Iorio [19] allows for the specification of a wide range of distributions for the weights and simultaneously leads to effective and widely applicable MCMC schemes on which algorithms 1 and 2 are based. More specifically, they show that a finite mixture model is equivalent to a realization of a stochastic process with random dimension and infinite-dimensional support, leading to flexible distributions for the weights of the mixture given by the normalization of a finite point process.
We thus employ this approach as it allows for efficient computations via a conditional algorithm, as compared with labour-intensive reversible jump algorithms common in mixture models. An alternative efficient conditional sampler for mixtures with a random number of components is the recently proposed telescopic sampler [33].
In the proposed framework, the observations are assumed to be sampled from a mixture with an inner and an outer component. As kernel of the mixture, we assume the hurdle model in (2.1), which distinguishes between the probabilities of being non-zero p i and the parameters of the Here the subscript 'old' denotes an existing (occupied) cluster and Note that when a subject i is assigned to a new outer cluster, then the full conditional distribution of z i is degenerate and a new auxiliary variable U m has to be sampled before moving to the next subject i + 1. end for Sample the latent variablesŪ and U 1 , . . . , U K from their full conditional: components within the mth outer mixture component. Letting ψ msj = (p mj , r msj , θ msj ) and ψ ms = (ψ ms1 , . . . , ψ msd ), the mixture model is as follows: Beta(η, λ) where the kernel f (y ij | ψ msj ) is defined via conditionally independent hurdle models in ( The outer mixture is a mixture of multivariate Bernoulli distributions, and coincides with the widely used latent class model [34]. Moreover, being conditionally independent of the actual values of the non-zero observations, it offers further computation advantages as shown in §3. Model (2.3) induces a partition of the subject indices {1, . . . , n} at an outer and an inner level. Let c i and z i , for i = 1, . . . , n, denote the allocation variables which indicate to which component of the mixture each subject is assigned to at the outer and inner level, respectively. When two subjects, i and l, are assigned to the same component of the outer level mixture, then the probabilities of observing a zero for the two subjects are the same, p i = p l , and the two subjects are assigned to the same cluster, i.e. c i = c l . Moreover, if the two subjects are also assigned to the same component of the inner level mixture, we have z i = z l and μ i = μ l (with obviously c i = c l ). However, the vectors of parameters μ i and μ l characterizing the sampling distribution might be different even when c i = c l and, consequently, the two subjects might be assigned to different clusters at the inner level. This is reflected in the components of the vectors of parameters (p i , μ i ) and (p l , μ l ), which might share only the component corresponding to the probability of zero outcomes or both components.
Using allocation variables, the conditional dependence structure between outer and inner levels is the following. Let Beta(α, β) Inner mixture: Beta(η, λ) where, as before, we denote with p m , r ms and θ ms the component-specific parameters, which are assumed a priori independent and Gamma(α, β) is the Gamma distribution with mean α/β and variance α/β 2 . The choice of Gamma distribution for the unnormalized weight of the mixture leads to the standard Dirichlet distribution for the normalized weights. In this setting, the computations are greatly simplified by the introduction of a latent variable, conditionally on which the unnormalized weights are independent. See Argiento & De Iorio [19] for details. Note that the inner mixture is here defined conditionally on the probabilities p m,j of being zero and not onỸ i . Thus, while conditioning on p m,j , Y i is still allowed to present zero entries. Finally, we highlight that representations (2.3) and (2.5)-(2.6) are equivalent.

Inference
Posterior inference can be performed through both a conditional and a marginal algorithm, derived by extending the algorithms by Argiento & De Iorio [19] to the enriched set-up. The conditional algorithm is described in algorithm 1, while in algorithm 2 we present the marginal one. The conditional algorithm is very flexible and allows for different prior distributions on the weights of the two mixtures as well as on M and S m (see [19] for details). In algorithm 2, we use the notation q M and q S to denote the prior on M and S m , respectively, and we set them both equal to a shifted Poisson for the application in §4. Furthermore, h out and h in denote the prior distribution on the unnormalized weights (in our case Gamma distributions) of the outer and inner mixture, respectively, ψ out (u) and ψ in (u) denote the corresponding Laplace transforms of h out and h in (in our case ψ out (u) = (u + 1) −γ M and ψ in (u) = (u + 1) −γ S ).
To implement the marginal algorithm, we need to derive the marginal likelihood of the data, conditionally on cluster membership. The likelihood in equation (2.3) can be written as Recall that c i and z i denote the labels of the clusters to which the ith subject belongs to in the outer and the inner clustering, respectively. The marginal likelihood of the data conditionally on the cluster allocation is obtained marginalizing with respect to the prior distributions defined in (2.5) and (2.6). For a vector of counts y, we obtain: and When implementing the marginal algorithm, after updating the latent variablesŪ and U m , we could add an extra step involving a shuffle of the nested partition structure as suggested by Wade et al. [25] to improve mixing. More details and an empirical comparison of the two algorithms are provided in Section S3 of the electronic supplementary material.

Application to WhatsApp use during COVID-19 (a) Data description and preprocessing
We apply our model to a dataset on WhatsApp use during COVID-19 [31]. The data consist of a questionnaire filled out by participants living in India. Each subject answers the same d = 7 questions for T = 7 consecutive days on the number of (j = 1) COVID-19 messages forwarded, (j = 2) WhatsApp groups to which COVID-19 messages were forwarded, (j = 3) people to whom COVID-19 messages were forwarded, (j = 4) unique forwarded messages received in personal chats, (j = 5) people from whom forwarded messages were received, (j = 6) personal chats that discussed COVID-19, (j = 7) WhatsApp groups that mentioned COVID-19. Table S1 in the electronic supplementary material provides the list of the questions, as well as a brief description. In what follows, the first replicate (t = 1) corresponds to Sunday for all subjects, t = 2 to Monday, up to T = 7 corresponding to Saturday. The questionnaire responses were collected in June and July 2021, during India's infection wave of the Delta variant of the SARS-CoV-2 virus that causes coronavirus disease 2019 (COVID-19).
From the initial 1156 respondents, we remove two subjects for which no answers are available, resulting in a final sample size of n = 1154. Moreover, 19% of the observations are missing. We also treat counts higher than 400, which are very rare (seven observations out of 56 546), as missing data as they are very far from the range of the majority of the data. We handle missing data using a two-step procedure. Firstly, whenever possible, we recover missing zeros using deterministic imputation based on respondents' answers to other sections of the questionnaire. For instance, if the answer to the question 'Did you send any message of this kind today?' is 'No' and there is a missing value for the question 'How many?', we can reasonably assume that the answer to the latter question is zero. In this way, we can recover 0.5% of the missing observations. Secondly, the remaining missing values are imputed using random forest imputation (as implemented in the R package mice [35]). In Section S2 of the electronic supplementary material, we provide more details on the data imputation technique and we present an empirical study to quantify the impact of data imputation on the results presented in the next section. Figure S2 of the electronic supplementary material displays the data after imputation.
To account for the fact that T repeated observations are available for each subject and process, we need to slightly modify model (2.3). We do so by assuming that the different time points are independent of each other, so that repeated observations can be straightforwardly included in the proposed model. Let Y ijt denote the count for the ith subject and the jth process at time t, i = 1, . . . , n, j = 1, . . . , d and t = 1, . . . , T. We assume that Y ijt are conditionally independent, given the parameters of the model. Thus, the likelihood contribution of each subject i is given by T t=1 d j=1 f (y ijt | ψ msj ). It must be highlighted that we are clustering individuals based on the pattern of all their observations, at each time point t and for each process j.
Finally we note that, thanks to the probabilistic structure of the hurdle model for zero-inflated data, p i and the sampling distribution g(· | μ i ) reflect two distinct features of the respondents' behaviour: p i represents the probability of engaging in some COVID-19 related WhatsApp activity, while g(· | μ i ) captures the behaviour of those subjects who have actually engaged in the activity.

(b) Results
Posterior inference is performed through the conditional algorithm described in algorithm 1. We run the algorithm for 15 000 MCMC iterations, discarding the first 5000 as burn-in. Figure 2 shows that, at the outer level, the posterior distributions of the number of both components and clusters present a mode at the value three.
As point estimate of the cluster allocation, we report the configuration that minimizes the posterior expectation of Binder's loss function [36] under equal misclassification costs, which is a common choice in the applied Bayesian non-parametrics literature [37]. Briefly, this expectation of the loss measures the difference for all possible pairs of subjects between the posterior probability of co-clustering and the estimated cluster allocation. We refer to the resulting cluster allocation as the Binder estimate.  The Binder estimate of the outer clustering contains three clusters, whose characteristics are summarized in figures 3 and 4. The largest cluster corresponds to WhatsApp users who on most days report a zero count for all d = 7 questions. The individuals in the other two clusters use WhatsApp more frequently when it comes to forwarding COVID-19 messages (j = 1, 2), receiving forwarded messages (j = 3, 4, 5) and having COVID-19 mentioned in their WhatsApp groups (j = 7). The main feature distinguishing Cluster 2 from Cluster 3 in terms of probabilities p i of nonzero counts is that on most days Cluster 2, unlike Cluster 3, discusses COVID-19 also in personal chats (question j = 6). Figures 5 and 6 display the main characteristics of the inner clusters. We are interested in the posterior distribution of the number of the inner clusters per outer cluster, as well as the inner clustering within each outer cluster. To this end, we run the MCMC algorithm fixing the outer cluster allocation to its Binder estimate, thus obtaining the conditional posterior distribution of the inner clustering. The results reveal substantial variability in the distribution of non-zero counts within outer Clusters 1 and 2 (see figure 5c). The majority of counts in outer Cluster 1 are zero, leaving little variation in the counts for the inner clustering. As most individuals present zero counts (for most processes) at an inner cluster level, it becomes difficult to detect specific patterns as it is also evident from the fact that many co-clustering probabilities are in the range 0.   (see figure 6). Notably, around a quarter of the individuals in outer Cluster 2, as captured by its inner Cluster 2, forward COVID-19 messages to many more people (question j = 3) than subjects in inner Cluster 1 of outer Cluster 2. Figure 4 also supports the fact that outer Cluster 2 engages with WhatsApp in a much more persistent manner than the other outer clusters. These results highlight that a sizeable minority of WhatsApp users has a relatively large propensity to spread COVID-19 messages during a critical phase of the pandemic. This is in line with a similar survey in Singapore [38] and findings on 'superspreaders' on other social media.

Conclusion
In this work, we propose a Bayesian model for multiple zero-inflated count data, building on the well-established hurdle model and exploiting the flexibility of finite mixture models with a random number of components. The main contribution of this work is the construction of an enriched finite mixture with a random number of components, which allows for two-level (nested) clustering of the subjects based on their pattern of counts across different processes. This structure enhances interpretability of the results and has the potential to better capture important features of the data. We design a conditional and a marginal MCMC sampling scheme to perform posterior inference. The proposed methodology has wide applicability, since excess-of-zeros count data arise in many fields. Our motivating application involves answers to a questionnaire on the use of WhatsApp in India during the COVID-19 pandemic. Our analysis identifies a two-level clustering of the subjects: the outer cluster allocation reflects daily probabilities of engaging in different WhatsApp activities, while the inner level informs on the number of messages conditionally on the fact that the subject is indeed receiving/sending messages on WhatsApp. Any two subjects are clustered together if they show a similar pattern across the multiple responses. We find three different well-distinguished respondent behaviours corresponding to the three outer clusters: (i) subjects with low probability of daily utilization; (ii) subjects with high probability of sending/receiving all types of messages and (iii) subjects with high probability for all considered messages except for non-forwarded messages in personal chats. Interestingly, the inner level clustering and the outer cluster-specific estimates of the sampling distribution g highlight similarities between the outer Clusters 1 and 3, where subjects tend to send/receive fewer messages compared with outer Cluster 2. Moreover, we are able to identify those subjects with a high propensity to spread COVID-19 messages during the critical phase of the pandemic and for these subjects we do not find notable differences in terms of types of messages sent or received. Our results are in line with existing literature on the topic. Future work involves the development of more complex clustering hierarchies and techniques able to identify processes that most inform the clustering structure.
Data accessibility. Due to ethical and regulatory constraints on sharing human subject data, the dataset is not available.
Authors' contributions. B.F.: methodology; A.C.: methodology; W.v.d.B.: data curation, writing-original draft; M.D.I.: conceptualization, writing-review and editing. All authors gave final approval for publication and agreed to be held accountable for the work performed therein.